Character Stream Parsing of Mixed-lingual Text

نویسندگان

  • Harald Romsdorfer
  • Beat Pfister
چکیده

In multilingual countries text-to-speech synthesis systems often have to deal with sentences containing inclusions of multiple other languages in form of phrases, words or even parts of words. Such sentences can only be correctly processed using a system that incorporates a mixed-lingual morphological and syntactic analyzer. A prerequisite for such an analyzer is the correct identification of word and sentence boundaries. Traditional text analysis applies to both problems simple heuristic methods within a text preprocessing step. These methods, however, are not reliable enough for analyzing mixed-lingual sentences. This paper presents a new approach towards word and sentence boundary identification for mixed-lingual sentences that bases upon parsing of character streams. Additionally this approach can also be used for word identification in languages without a designated word boundary symbol like Chinese or Japanese. To date, this mixed-lingual text analysis supports any mixture of English, French, German, Italian and Spanish.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mixed-lingual text analysis for polyglot TTS synthesis

Text-to-speech (TTS) synthesis is more and more confronted with the language mixing phenomenon. An important step towards the solution of this problem and thus towards a socalled polyglot TTS system is an analysis component for mixedlingual texts. In this paper it is shown how such an analyzer can be realized for a set of languages, starting from a corresponding set of monolingual analyzers whi...

متن کامل

One Tree is not Enough: Cross-lingual Accumulative Structure Transfer for Semantic Indeterminacy

We address the task of parsing semantically indeterminate expressions, for which several correct structures exist that do not lead to differences in meaning. We present a novel non-deterministic structure transfer method that accumulates all structural information based on cross-lingual word distance derived from parallel corpora. Our system’s output is a ranked list of trees. To evaluate our s...

متن کامل

Cross-lingual Transfer for Unsupervised Dependency Parsing Without Parallel Data

Cross-lingual transfer has been shown to produce good results for dependency parsing of resource-poor languages. Although this avoids the need for a target language treebank, most approaches have still used large parallel corpora. However, parallel data is scarce for low-resource languages, and we report a new method that does not need parallel data. Our method learns syntactic word embeddings ...

متن کامل

A Cross-Lingual Induction Technique for German Adverbial Participles

We provide a detailed comparison of strategies for implementing medium-tolow frequency phenomena such as German adverbial participles in a broadcoverage, rule-based parsing system. We show that allowing for general adverb conversion of participles in the German LFG grammar seriously affects its overall performance, due to increased spurious ambiguity. As a solution, we present a corpus-based cr...

متن کامل

Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing

This paper describes our systems and results on VarDial 2017 shared tasks. Besides three language/dialect discrimination tasks, we also participated in the cross-lingual dependency parsing (CLP) task using a simple methodology which we also briefly describe in this paper. For all the discrimination tasks, we used linear SVMs with character and word features. The system achieves competitive resu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006